perf(rust-kernel): Intern-assigned uids, symbolic Nat offsets, environment-machine WHNF reducer by samuelburnham · Pull Request #442 · argumentcomputer/ix

samuelburnham · 2026-06-12T16:19:50Z

This pull request implements Rust-kernel and ixon optimizations for the Zisk guest, measured against the base #411 as deterministic guest cycle counts (ziskemu -m) over dumped shard inputs (zisk/scripts/bench-cycles.sh), with every counted run's committed failures publics word verified zero.

Key Results

Int16.instRxcHasSize_eq: 5.70 B → 42.0 M guest steps (135.6×); Int32/Int64 variants confirmed at the same magnitude (~42 M) — in-circuit cost is now integer-width-independent
Cycle suite total: 17.53 B → 5.77 B steps (3.04×, −67.1%)
Environment-machine WHNF (final two kernel commits): −19.3% suite net; Vector.extract_append._proof_1 −32.4%; the Init Array/Vector Extract proof family −22% to −33% each (five multi-billion-step constants)
Std.Tactic.BVDecide.BVExpr.bitblast.goCache_Inv_of_Inv._mutual (28.35 B steps, the most expensive constant measured in Init+Std): previously crashed the 512 MB guest (OOM-class null read) → now completes as a valid kernel pass — lazy closures cut peak RAM, not just cycles
Kernel hardened against adversarial inputs: term identity moved off per-node content hashing with the threat model documented (docs/kernel_identity.md), blob bytes verified at load in all deserializers

input	base (#411)	this PR	Δ	speedup
_private.Init.Data.Range.Polymorphic.SInt.0.Int16.instRxcHasSize_eq	5,697,166,200	42,012,168	−99.3%	135.6×
_private.Init.Data.Vector.Extract.0.Vector.extract_append._proof_1	7,690,887,211	3,344,459,375	−56.5%	2.3×
List.mergesort synthetic env	4,005,320,347	2,290,310,593	−42.8%	1.7×
Nat.add_comm	18,690,815	11,532,048	−38.3%	1.6×
Array.binSearchAux._unary	95,800,411	62,862,965	−34.4%	1.5×
Nat.gcd_comm	6,669,821	5,550,752	−16.8%	1.2×
Batteries.RBMap	13,940,412	12,321,092	−11.6%	1.1×
String.append	4,560,304	4,070,450	−10.7%	1.1×
total	17,533,035,521	5,773,119,443	−67.1%	3.04×

Fifteen Commits (oldest first)

bench: guest cycle harness + native single-constant check
- bench-cycles.sh suite over dumped guest stdins with failures-word validation; check_one example mirrors the guest's reuse-mode check natively
kernel: keep symbolic Nat offsets compact (whnf stuck + offset def-eq)
- Nat.add x lit / Nat.div|mod x k stay stuck as compact offsets instead of materializing succ^n(x) chains or the division algorithm; bulk offset def-eq decides offset pairs directly
- The Int16.instRxcHasSize_eq collapse (5.70 B → 56 M) and the UTF-8-codec-class fix
kernel: intern-assigned uids replace per-node blake3 content hashing
- Term identity becomes intern-table-assigned sequential u64s; shallow structural keys for hash-consing; blake3 stays only at the Ixon boundary (content addresses, Merkle roots, proof-carrying-code claims)
- The broadest single win: −30–36% on every reduction-heavy input
kernel: memoized prim-family dispatch in the WHNF/def-eq reduction loops
- Constant heads classified once per address (Native/BitVec/Nat/Decidable/Str); at most one family recognizer runs per iteration instead of the five-probe gauntlet
ixon: defer per-constant address verification to first materialization
- Constants verify on first get(); blob bytes still verified eagerly at load
bench: add Int32/Int64 instRxcHasSize_eq to the cycle suite
docs: kernel uid identity vs Ixon content addressing
- Two-layer identity model and why uid collisions are not an attack surface (uids are assigned, never computed from input)
kernel: fix PrimFamily visibility/qualification warnings
kernel: port jcb/fixes H-15 whnf probe pre-filter + H-12 nat output caps
- Allocation-free transient-nat spine probe ahead of the cache lookup; caps on Nat primitive output sizes
kernel+ixon: harden term identity against adversarial inputs
- Blob bytes verified against their address at load in all three deserializers; structural equality audited against the adversarial model
kernel: privatize uid-accepting constructors; document cross-shard uid story
docs: design for the environment-machine WHNF port
- Full design for the Krivine-style machine (ported from the proven IxVM implementation), including the exit-to-outer-loop contract and the known regression class
kernel: environment-machine WHNF (Phase A — lazy substitution on the beta path)
- whnf_core's App arm enters a closure machine when a beta fires: beta/zeta are O(1) environment pushes; substitution materializes only at machine exits (clo_subst readback), so work is proportional to what the reduction consumes
- Every exit re-enters the existing loop: ambient zeta, iota, prim dispatch, cheap-mode def-eq flags, and all caches stay byte-identical
kernel: closure-iota at the machine's recursor exit (Phase B)
- Recursor spines consumed lazily: only the major premise materializes; params/motives/minors and post-major args ride through as closures, so unselected minors (dropped match/Decidable branches) are never substituted and never read back
- K recursors, literal majors, and struct-eta deliberately miss to the unchanged plain path, preserving the Nat transient-work and linear-rec/offset shortcuts exactly
docs: env-machine WHNF design is implemented (Phases A+B)

zisk/scripts/bench-cycles.sh runs the standard dumped-input suite (~/benchdata/zisk-inputs) through ziskemu, records steps, and verifies the committed `failures` publics word is zero. check_one mirrors the Zisk guest's one-work-item path natively so IX_* instrumentation can target a single expensive constant.

Port of IxVM dbc4177/5dcab7f/f7cfe23 to the Rust kernel, three coordinated changes: - whnf (try_nat_offset_stuck): leave `Nat.add base (Lit n)` (symbolic base, n>0) and `Nat.div/mod base (Lit k)` (k>=2) STUCK in compact form instead of delta-unfolding into succ^n towers / the division algorithm. Iota over such majors already works via cleanup_nat_offset_major. - linear-rec: collapse `succ^o(Nat.rec base succ-step (Lit n))` with a SYMBOLIC base to `Nat.add base (Lit n+o)` instead of declining into n iota steps. - def-eq (try_def_eq_offset): decompose both sides to base + offset (literals, succ layers, add-lit read in O(1)) and strip the shared offset in ONE step. The previous one-succ-peel recursed through full is_def_eq per layer, which blew MAX_DEF_EQ_DEPTH=2000 on deep offsets. Guest cycles (ziskemu, vs d2ea3d9 baseline): Int16.instRxcHasSize_eq 5,697,166,200 -> 56,028,269 (-99.0%) binsearch 95,800,411 -> 87,735,518 (-8.4%) Nat.add_comm 18,690,815 -> 17,805,210 (-4.7%) mergesort shard 4,005,320,347 -> 3,920,521,476 (-2.1%) Vector.extract_append._proof_1 unchanged (+0.02%), rest ~0. 604 kernel tests pass (one test updated: whnf of `x + 2` now asserts the compact stuck form; new test covers tower-vs-compact def-eq).

Every KExpr/KUniv construction used to compute a blake3 hash of its content for identity (intern key, cache keys, equality) — ~20% of guest cycles on reduction-heavy constants (app_hash 22-33% cumulative). Identity is now an intern-assigned u64 uid: - uids come from a process-global atomic counter and are NEVER reused, so uid equality implies structural equality and cache keys cannot alias across intern-table clears (a stale key can only miss). - InternTable keys on shallow structural keys (ExprKey/UnivKey): variant tag + child uids + semantic payload, mirroring exactly what the old content hash covered (display names, binder info, mdata excluded). intern_expr/intern_univ canonicalize children recursively when handed nodes built outside the table, preserving the old "structural identity" interning semantics. - PartialEq for KExpr/KUniv is now structural with the uid fast path (canonical-vs-canonical compares O(1)); hash_eq remains the fast-but-incomplete uid check for cache/quick-path callers where a false negative only costs a fallback. - Cache ctx components stay blake3 (CtxAddr) but hash uids instead of content addrs; push_local/push_let intern their frame types so equal context suffixes share cache entries. - ingress's hash-first intern probe becomes key-first (same skip-the- construction win, no hash). Soundness note: affirmative identity (equal uid => equal term) is what def-eq quick paths and caches rely on, and it holds by construction; incomplete negatives only cost cache misses. Nat/Str literals keep their blake3 blob Addresses (ingress-meaningful). Guest cycles (ziskemu, vs previous commit): mergesort shard 3,920,521,476 -> 2,526,226,208 (-35.6%) Vector.extract_..._1 7,692,517,694 -> 5,369,879,578 (-30.2%) Nat.add_comm 17,805,210 -> 12,687,822 (-28.7%) binsearch 87,735,518 -> 64,562,040 (-26.4%) Int16.instRxc..._eq 56,028,269 -> 45,024,529 (-19.6%) natgcdcomm/stringappend/rbmap: -11.2% / -6.9% / -2.7% 604 kernel tests pass (16 hash-determinism tests updated to assert the preserved property — identity ignores names/bi/mdata — via structural equality, shallow-key equality, or shared-table interning).

The WHNF loops (full + no-delta) and the def-eq lazy-delta loop probed all five primitive recognizers (native/bitvec/nat/decidable/string) every iteration; each collects its own app spine and runs its own gauntlet of 32-byte address compares — pure tax on ordinary constant-headed terms. Classify the head once per iteration via an allocation-free app-chain walk + per-address memo (KEnv::prim_family_cache), and call at most the one matching family reducer. Family head-address sets are disjoint, so dispatch order semantics are unchanged. The nat-offset-stuck probe gates on the Nat family too. Mirrors the IxVM prim_family dispatch memo (ix-aiur, measured there as part of -28% on Vector._proof_2). Guest cycles (ziskemu, vs previous commit): mergesort shard 2,526,226,208 -> 2,399,597,758 (-5.0%) Vector.extract_..._1 5,369,879,578 -> 5,112,313,900 (-4.8%) Nat.add_comm 12,687,822 -> 12,249,578 (-3.5%) Int16.instRxc..._eq 45,024,529 -> 43,835,207 (-2.6%) binsearch 64,562,040 -> 64,095,836 (-0.7%) others ~0 (+25 steps cache-population noise on two inputs) 604 kernel tests pass.

Env::get_anon paid one blake3 content-hash per constant at parse time. The binding check moves into LazyConstant::get() (pending_addr field): constants shipped in a closure but never forced by the typechecker are never hashed, while everything the kernel certifies is still verified on materialization — checking a constant forces it. The compile-side from_constant path (pre-populated cache) is exempt as before. Guest cycles: rbmap -9.5%, natgcdcomm -6.4%, stringappend -4.2%, Int16 -3.7%, binsearch -1.6%, nataddcomm -0.3%; mergesort/vector +0.01% (re-verification of the few constants materialized more than once). 200 ixon + 604 kernel tests pass (tampering test updated to assert rejection at get() instead of at parse).

Both complete at ~42M steps, matching Int16 — the nat-offset compaction makes the cost range-independent (pre-fix, Int16's 2^16 range alone was 5.7B steps and Int32/Int64 would not have completed).

Records the two-identity-layer boundary after 1e3029d: constants/blobs keep their blake3 Ixon Addresses (claims, Merkle roots, proof store, cross-run reuse, and per-constant inclusion proofs are all unchanged), while the removed blake3 hashing was the kernel-internal per-node scheme used only for intern/cache identity inside one checker run — never serialized, never part of a proof artifact. Documents the soundness argument (equal uid => equal term, never reused), the boundary rule (uids must not escape into serialized artifacts), and what a future sub-constant commitment feature would need to recompute at the egress boundary.

PrimFamily is referenced by the public KEnv::prim_family_cache field, so it must be pub; drop a redundant path qualification.

H-15 (perf): the transient-nat probes run on every whnf/whnf_core call BEFORE the cache lookup; they paid two collect_app_spine Vec allocations plus a KConst::Recr clone per call on the overwhelmingly common non-Nat-recursor path. Pre-filter with an allocation-free spine_head_and_len; spine collection + Recr lookup now happen only for actual Nat-recursor heads. Also drop the redundant rules.clone() in the iota IotaInfo build (rules is already owned from the try_get_const clone). H-12 (hardening): output-size cap (2^26 bits) on natively computed shiftLeft/pow results — a ~12-byte adversarial term could otherwise demand a multi-GiB allocation (cycle explosion in-guest). Oversized results stay stuck, like the existing pow exponent guard. Ported from jcb/fixes 3763356 (pre-workspace layout), adapted to kernel-riscv. Guest cycles (ziskemu, vs previous commit): mergesort shard 2,399,808,204 -> 2,313,445,541 (-3.6%) Vector.extract_..._1 5,111,066,294 -> 4,949,833,244 (-3.2%) Nat.add_comm 12,217,888 -> 11,798,013 (-3.4%) Int16/32/64 rxc ~42.2M -> ~41.3M (-2.1%) binsearch 63,115,445 -> 62,640,316 (-0.75%) small inputs ~0. 604 kernel tests pass.

Closes the adversarial-binding review of the uid-interning design: - ixon: blob bytes are blake3-verified against their address at load in all three env deserializers (ports jcb/fixes H-1). This is the one place adversarial CONTENT enters kernel identity: ExprKey::Nat/Str and literal structural equality key on the blob Address (mirroring the old content hash), which is sound only if address binds bytes. Costs +0.06..+0.29% guest cycles (blobs are a small fraction of env bytes; constants stay lazily verified). - kernel: fresh_uid aborts on counter exhaustion instead of wrapping (>2^67 guest cycles to reach; identity soundness should not rest on "probably won't happen"). - kernel: literal structural-equality arms compare values as well as blob addresses (defense in depth — redundant under the load-time check, degrades to inequality if that invariant were ever violated). - kernel: intern hits assert structural equality in debug builds. - docs/kernel_identity.md: adversarial-model section — why uid identity cannot be cache-poisoned (uids are assigned by a sequential counter, never computed from content, so there is no hash function to collide; intern keys are exact structural Eq, never digests; cache hits require uid equality; map-bucket collisions cost a key compare, not a wrong hit). 604 kernel + 200 ixon tests pass.

…d story The *_mdata_with_addr constructors were the only API through which a caller-supplied (hence potentially duplicated) uid could enter a node; nothing outside expr.rs uses them since the ingress key-first rework, so make them private — uid uniqueness per process is now enforced by visibility, not convention. Docs: spell out the within-shard (impossible by assignment) vs between-shards (certain and harmless; uids never cross the process boundary) collision analysis.

Krivine-style machine for the structural WHNF fragment, ported from the proven IxVM implementation (ix-aiur d0d3375 Phase A hybrid entry, b4c4ea3 Phase B closure-iota; measured there Int16 -33.8%, Vector -17.1%, mergeSort -6.8%). Covers the Rust-specific mapping the IxVM side doesn't have: cheap_proj/cheap_rec threading, fuel retention (adversarial posture), ambient let/LDecl zeta via the exit-to-outer- loop contract, iterative loop for stack discipline, clo_subst readback with lbr guards + per-call scratch, unchanged cache/uid semantics. Phased A (beta/zeta) then B (closure-iota, the UTF-8-class payoff), with the known curried-sharing regression class added to the bench suite before implementation.

…beta path) Port of the IxVM environment machine (~/ix-aiur d0d3375; design in docs/env_machine_whnf.md). whnf_core's App arm enters a Krivine-style machine when a beta fires: beta/zeta become O(1) environment pushes onto a persistent cons-list of closures, and substitution materializes only at machine exits (clo_subst readback) — work proportional to what the reduction consumes, not to every beta performed. A beta chain ending in another beta never materializes its intermediate bodies. - subst.rs: Clo/MEnv closure types (carried env length; per-Clo OnceCell readback memo standing in for Aiur's global memoization), clo_subst/clo_readback with lbr guards, per-call (uid, depth) memo from a new InternTable scratch pool, results interned like subst. - whnf.rs: iterative machine_whnf loop + machine_exit, replacing the multi-lambda peel + simul_subst block. Fuel charged on beta/zeta at the eager path's one-substitution-event granularity. Every exit reads back and re-enters the outer loop, so ambient let/LDecl zeta, iota, prim dispatch, and cheap_proj/cheap_rec def-eq discipline are unchanged; all caches and keys untouched. - bench-cycles.sh: add natreclinear/natfoldsucc (foldAdd-class sentinels for the known curried-sharing regression class). 609 kernel + 200 ixon tests pass; native check_one true on Nat.add_comm, Nat.gcd_comm, Int16.instRxcHasSize_eq, natRecLinearCheck, Nat.fold_succ; perf_check whole-env failure parity with HEAD on 11 envs. Guest cycle benchmarks pending (box busy).

Port of IxVM try_iota_c (~/ix-aiur 097e9d5). The machine's Const exit no longer reads its spine back when the head is a recursor on the main ctor-rule path: only the major premise materializes (readback + the same full whnf the eager path uses), and on a hit the rule RHS re-enters the machine with params/motives/minors and post-major arguments as their ORIGINAL closures plus the ctor fields wrapped closed. The rule's Lam-chain betas push them straight into an environment, so unselected minors — dropped match/Decidable branches, the UTF-8 codec class — are never substituted and never read back. Off-main-path cases miss to the plain readback exit by construction, preserving semantics exactly: K recursors (ctor synthesis needs infer + def-eq), Nat literal majors (transient-work cache discipline and the linear-rec/offset shortcuts stay on the plain path), Str literal majors, struct-eta candidates, and stuck majors. cheap_rec mode skips machine-iota entirely, mirroring the eager gating. A miss costs one readback of the major; the plain path's whnf of it then hits a warm cache. Machine-native delta (IxVM Phase C1.5) does NOT port at this layer: their machine lives in a whnf that includes delta, while ours sits inside whnf_core, which must stay delta-free for def-eq's lazy-delta unfold ordering. Spanning the whnf delta loop with closures is a separate, larger restructure. 610 kernel + 200 ixon tests pass (new closure-iota-via-beta unit test covers the spine arithmetic incl. post-major shift); native check_one true on the usual five constants; perf_check whole-env failure parity on 11 envs. Guest cycle benchmarks pending alongside Phase A's.

map_or_else over map().unwrap_or_else(), pass ExprKey by reference in timed_intern_or_build, explicit clone for spec_params dedup snapshot.

samuelburnham added 17 commits June 12, 2026 16:32

bench: add Int32/Int64 instRxcHasSize_eq to the cycle suite

2ed9b04

Both complete at ~42M steps, matching Int16 — the nat-offset compaction makes the cost range-independent (pre-fix, Int16's 2^16 range alone was 5.7B steps and Int32/Int64 would not have completed).

kernel: fix PrimFamily visibility/qualification warnings

1781268

PrimFamily is referenced by the public KEnv::prim_family_cache field, so it must be pub; drop a redundant path qualification.

docs: env-machine WHNF design is implemented (Phases A+B)

4e893e6

chore: apply cargo fmt

96cd080

chore: fix clippy warnings (workspace, all targets)

4d2452d

map_or_else over map().unwrap_or_else(), pass ExprKey by reference in timed_intern_or_build, explicit clone for spec_params dedup snapshot.

samuelburnham force-pushed the sb/kernel-perf branch from fbf4c7a to 4d2452d Compare June 12, 2026 16:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(rust-kernel): Intern-assigned uids, symbolic Nat offsets, environment-machine WHNF reducer#442

perf(rust-kernel): Intern-assigned uids, symbolic Nat offsets, environment-machine WHNF reducer#442
samuelburnham wants to merge 17 commits into
kernel-riscvfrom
sb/kernel-perf

samuelburnham commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

samuelburnham commented Jun 12, 2026

Key Results

Fifteen Commits (oldest first)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant